Exploratory Data Analysis (EDA) helps to answer all these questions, ensuring the best outcomes for the project. It is an approach for summarizing, visualizing, and understanding the important characteristics of a data set.
Exploratory Data Analysis is valuable to data science projects since it allows to get closer to the certainty that the future results will be valid, correctly interpreted, and applicable to the desired business contexts. EDA also helps to find insights that were not worth investigating to business stakeholders and data scientists but can be very informative about a particular business.
It is always better to explore each data set using multiple exploratory techniques and compare the results. The goal of this step is to become confident that the data set is ready to be used in a machine learning algorithm.
Exploratory Data Analysis is majorly performed using the following methods:
These methods help data scientist to identify the patterns and understand the problem.
In a hurry to get to the machine learning stage or simply impress business stakeholders very fast, data scientists tend to either entirely skip the exploratory process or do a very shallow work. It is a very serious and, sadly, common mistake of amateur data science consulting “professionals”.
Such inconsiderate behavior can lead to skewed data, with outliers and too many missing values and, therefore, some sad outcomes for the project:
Exploratory Data Analysis (EDA) is used on the one hand to answer questions, test business assumptions, generate hypotheses for further analysis. On the other hand, you can also use it to prepare the data for modeling.
The thing that these two probably have in common is a good knowledge of your data to either get the answers that you need or to develop an intuition for interpreting the results of future modeling.
There are a lot of ways to reach these goals as follows:
Import the data
Get a feel of the data ,describe the data,look at a sample of data like first and last rows
Take a deeper look into the data by querying or indexing the data
Identify features of interest
Recognise the challenges posed by data - missing values, outliers
Discover patterns in the data
One of the important things about EDA is Data profiling.
Data profiling is concerned with summarizing your dataset through descriptive statistics. You want to use a variety of measurements to better understand your dataset. The goal of data profiling is to have a solid understanding of your data so you can afterwards start querying and visualizing your data in various ways. However, this doesn’t mean that you don’t have to iterate: exactly because data profiling is concerned with summarizing your dataset, it is frequently used to assess the data quality. Depending on the result of the data profiling, you might decide to correct, discard or handle your data differently.
2 types of Data Analysis
Confirmatory Data Analysis
Exploratory Data Analysis
4 Objectives of EDA
Discover Patterns
Spot Anomalies
Frame Hypothesis
Check Assumptions
2 methods for exploration
Univariate Analysis
Bivariate Analysis
Stuff done during EDA
Trends
Distribution
Mean
Median
Outlier
Spread measurement (SD)
Correlations
Hypothesis testing
Visual Exploration
This is an exploratory data analysis on the House Prices Kaggle Competition found at
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
There are 1460 instances of training data and 1460 of test data. Total number of attributes equals 81, of which 36 are numerical, 43 are categorical + Id and SalePrice.
Numerical Features: 1stFlrSF, 2ndFlrSF, 3SsnPorch, BedroomAbvGr, BsmtFinSF1, BsmtFinSF2, BsmtFullBath, BsmtHalfBath, BsmtUnfSF, EnclosedPorch, Fireplaces, FullBath, GarageArea, GarageCars, GarageYrBlt, GrLivArea, HalfBath, KitchenAbvGr, LotArea, LotFrontage, LowQualFinSF, MSSubClass, MasVnrArea, MiscVal, MoSold, OpenPorchSF, OverallCond, OverallQual, PoolArea, ScreenPorch, TotRmsAbvGrd, TotalBsmtSF, WoodDeckSF, YearBuilt, YearRemodAdd, YrSold
Categorical Features: Alley, BldgType, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, BsmtQual, CentralAir, Condition1, Condition2, Electrical, ExterCond, ExterQual, Exterior1st, Exterior2nd, Fence, FireplaceQu, Foundation, Functional, GarageCond, GarageFinish, GarageQual, GarageType, Heating, HeatingQC, HouseStyle, KitchenQual, LandContour, LandSlope, LotConfig, LotShape, MSZoning, MasVnrType, MiscFeature, Neighborhood, PavedDrive, PoolQC, RoofMatl, RoofStyle, SaleCondition, SaleType, Street, Utilitif
SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale
Load libraries that you think you will require or you can load as you go along.
!pip install missingno
Collecting missingno Downloading missingno-0.5.1-py3-none-any.whl (8.7 kB) Requirement already satisfied: matplotlib in c:\users\hp\anaconda3\lib\site-packages (from missingno) (3.6.2) Requirement already satisfied: numpy in c:\users\hp\anaconda3\lib\site-packages (from missingno) (1.21.5) Requirement already satisfied: seaborn in c:\users\hp\anaconda3\lib\site-packages (from missingno) (0.12.2) Requirement already satisfied: scipy in c:\users\hp\anaconda3\lib\site-packages (from missingno) (1.9.3) Requirement already satisfied: packaging>=20.0 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib->missingno) (22.0) Requirement already satisfied: pillow>=6.2.0 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib->missingno) (9.3.0) Requirement already satisfied: pyparsing>=2.2.1 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib->missingno) (3.0.9) Requirement already satisfied: cycler>=0.10 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib->missingno) (0.11.0) Requirement already satisfied: python-dateutil>=2.7 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib->missingno) (2.8.2) Requirement already satisfied: fonttools>=4.22.0 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib->missingno) (4.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib->missingno) (1.4.4) Requirement already satisfied: contourpy>=1.0.1 in c:\users\hp\anaconda3\lib\site-packages (from matplotlib->missingno) (1.0.5) Requirement already satisfied: pandas>=0.25 in c:\users\hp\anaconda3\lib\site-packages (from seaborn->missingno) (1.4.4) Requirement already satisfied: pytz>=2020.1 in c:\users\hp\anaconda3\lib\site-packages (from pandas>=0.25->seaborn->missingno) (2022.7) Requirement already satisfied: six>=1.5 in c:\users\hp\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0) Installing collected packages: missingno Successfully installed missingno-0.5.1
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy.stats as st
from sklearn import ensemble, tree, linear_model
import missingno as msno
To start exploring your data, you’ll need to start by actually loading in your data. You’ll probably know this already, but thanks to the Pandas library, this becomes an easy task: you import the package as pd, following the convention, and you use the read_csv() function, to which you pass the URL in which the data can be found and a header argument. This last argument is one that you can use to make sure that your data is read in correctly: the first row of your data won’t be interpreted as the column names of your DataFrame.
Alternatively, there are also other arguments that you can specify to ensure that your data is read in correctly: you can specify the delimiter to use with the sep or delimiter arguments, the column names to use with names or the column to use as the row labels for the resulting DataFrame with index_col.
train = pd.read_csv('train1.csv')
test = pd.read_csv('test1.csv')
One of the most elementary steps to do this is by getting a basic description of your data. A basic description of your data is indeed a very broad term: you can interpret it as a quick and dirty way to get some information on your data, as a way of getting some simple, easy-to-understand information on your data, to get a basic feel for your data. We can use the describe() function to get various summary statistics that exclude NaN values.
train.describe()
| Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | ... | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1460.000000 | 1460.000000 | 1201.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1452.000000 | 1460.000000 | ... | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 |
| mean | 730.500000 | 56.897260 | 70.049958 | 10516.828082 | 6.099315 | 5.575342 | 1971.267808 | 1984.865753 | 103.685262 | 443.639726 | ... | 94.244521 | 46.660274 | 21.954110 | 3.409589 | 15.060959 | 2.758904 | 43.489041 | 6.321918 | 2007.815753 | 180921.195890 |
| std | 421.610009 | 42.300571 | 24.284752 | 9981.264932 | 1.382997 | 1.112799 | 30.202904 | 20.645407 | 181.066207 | 456.098091 | ... | 125.338794 | 66.256028 | 61.119149 | 29.317331 | 55.757415 | 40.177307 | 496.123024 | 2.703626 | 1.328095 | 79442.502883 |
| min | 1.000000 | 20.000000 | 21.000000 | 1300.000000 | 1.000000 | 1.000000 | 1872.000000 | 1950.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2006.000000 | 34900.000000 |
| 25% | 365.750000 | 20.000000 | 59.000000 | 7553.500000 | 5.000000 | 5.000000 | 1954.000000 | 1967.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 | 2007.000000 | 129975.000000 |
| 50% | 730.500000 | 50.000000 | 69.000000 | 9478.500000 | 6.000000 | 5.000000 | 1973.000000 | 1994.000000 | 0.000000 | 383.500000 | ... | 0.000000 | 25.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 2008.000000 | 163000.000000 |
| 75% | 1095.250000 | 70.000000 | 80.000000 | 11601.500000 | 7.000000 | 6.000000 | 2000.000000 | 2004.000000 | 166.000000 | 712.250000 | ... | 168.000000 | 68.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.000000 | 2009.000000 | 214000.000000 |
| max | 1460.000000 | 190.000000 | 313.000000 | 215245.000000 | 10.000000 | 9.000000 | 2010.000000 | 2010.000000 | 1600.000000 | 5644.000000 | ... | 857.000000 | 547.000000 | 552.000000 | 508.000000 | 480.000000 | 738.000000 | 15500.000000 | 12.000000 | 2010.000000 | 755000.000000 |
8 rows × 38 columns
Now that you have got a general idea about your data set, it’s also a good idea to take a closer look at the data itself. With the help of the head() and tail() functions of the Pandas library, you can easily check out the first and last lines of your DataFrame, respectively.
Let us look at some sample data
train.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
train.tail()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1455 | 1456 | 60 | RL | 62.0 | 7917 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 175000 |
| 1456 | 1457 | 20 | RL | 85.0 | 13175 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal | 210000 |
| 1457 | 1458 | 70 | RL | 66.0 | 9042 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | Shed | 2500 | 5 | 2010 | WD | Normal | 266500 |
| 1458 | 1459 | 20 | RL | 68.0 | 9717 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal | 142125 |
| 1459 | 1460 | 20 | RL | 75.0 | 9937 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | 147500 |
5 rows × 81 columns
train.shape , test.shape
((1460, 81), (1459, 80))
Let us examine numerical features in the train dataset
numeric_features = train.select_dtypes(include=[np.number])
numeric_features.columns
Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
dtype='object')
numeric_features.head()
| Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | ... | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | 65.0 | 8450 | 7 | 5 | 2003 | 2003 | 196.0 | 706 | ... | 0 | 61 | 0 | 0 | 0 | 0 | 0 | 2 | 2008 | 208500 |
| 1 | 2 | 20 | 80.0 | 9600 | 6 | 8 | 1976 | 1976 | 0.0 | 978 | ... | 298 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 2007 | 181500 |
| 2 | 3 | 60 | 68.0 | 11250 | 7 | 5 | 2001 | 2002 | 162.0 | 486 | ... | 0 | 42 | 0 | 0 | 0 | 0 | 0 | 9 | 2008 | 223500 |
| 3 | 4 | 70 | 60.0 | 9550 | 7 | 5 | 1915 | 1970 | 0.0 | 216 | ... | 0 | 35 | 272 | 0 | 0 | 0 | 0 | 2 | 2006 | 140000 |
| 4 | 5 | 60 | 84.0 | 14260 | 8 | 5 | 2000 | 2000 | 350.0 | 655 | ... | 192 | 84 | 0 | 0 | 0 | 0 | 0 | 12 | 2008 | 250000 |
5 rows × 38 columns
From the Dataset we have 4 year variables. We have extract information from the datetime variables like no of years or no of days.
# list of variables that contain year information
year_feature = [feature for feature in numeric_features if 'Yr' in feature or 'Year' in feature]
year_feature
['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'YrSold']
# Let us explore the contents of temporal variables
for feature in year_feature:
print(feature, train[feature].unique())
YearBuilt [2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 1965 2005 1962 2006 1960 1929 1970 1967 1958 1930 2002 1968 2007 1951 1957 1927 1920 1966 1959 1994 1954 1953 1955 1983 1975 1997 1934 1963 1981 1964 1999 1972 1921 1945 1982 1998 1956 1948 1910 1995 1991 2009 1950 1961 1977 1985 1979 1885 1919 1990 1969 1935 1988 1971 1952 1936 1923 1924 1984 1926 1940 1941 1987 1986 2008 1908 1892 1916 1932 1918 1912 1947 1925 1900 1980 1989 1992 1949 1880 1928 1978 1922 1996 2010 1946 1913 1937 1942 1938 1974 1893 1914 1906 1890 1898 1904 1882 1875 1911 1917 1872 1905] YearRemodAdd [2003 1976 2002 1970 2000 1995 2005 1973 1950 1965 2006 1962 2007 1960 2001 1967 2004 2008 1997 1959 1990 1955 1983 1980 1966 1963 1987 1964 1972 1996 1998 1989 1953 1956 1968 1981 1992 2009 1982 1961 1993 1999 1985 1979 1977 1969 1958 1991 1971 1952 1975 2010 1984 1986 1994 1988 1954 1957 1951 1978 1974] GarageYrBlt [2003. 1976. 2001. 1998. 2000. 1993. 2004. 1973. 1931. 1939. 1965. 2005. 1962. 2006. 1960. 1991. 1970. 1967. 1958. 1930. 2002. 1968. 2007. 2008. 1957. 1920. 1966. 1959. 1995. 1954. 1953. nan 1983. 1977. 1997. 1985. 1963. 1981. 1964. 1999. 1935. 1990. 1945. 1987. 1989. 1915. 1956. 1948. 1974. 2009. 1950. 1961. 1921. 1900. 1979. 1951. 1969. 1936. 1975. 1971. 1923. 1984. 1926. 1955. 1986. 1988. 1916. 1932. 1972. 1918. 1980. 1924. 1996. 1940. 1949. 1994. 1910. 1978. 1982. 1992. 1925. 1941. 2010. 1927. 1947. 1937. 1942. 1938. 1952. 1928. 1922. 1934. 1906. 1914. 1946. 1908. 1929. 1933.] YrSold [2008 2007 2006 2009 2010]
We will compare the difference between All years feature with SalePrice
## Here we will compare the difference between All years feature with SalePrice
for feature in year_feature:
if feature!='YrSold':
data=train.copy()
## We will capture the difference between year variable and year the house was sold for
data[feature]=data['YrSold']-data[feature]
plt.scatter(data[feature],data['SalePrice'])
plt.xlabel(feature)
plt.ylabel('SalePrice')
plt.show()
for feature in year_feature:
if feature!='YrSold':
data=train.copy()
## We will capture the difference between year variable and year the house was sold for
data[feature]=data['YrSold']-data[feature]
plt.scatter(data[feature],data['SalePrice'])
plt.xlabel(feature, fontsize=12, fontfamily='serif')
plt.ylabel('SalePrice', fontsize=12, fontfamily='serif')
plt.title(f"{feature} versus SalePrice", fontsize=12,
fontfamily='serif', fontweight = 'bold')
g = sns.despine(top=True, right=True)
plt.tick_params(which=u'both', axis=u'both', length=0)
plt.show()
Note: The below piece of code may not be relevant piece of code as it has very little relevance when we look at it from the perspective of building machine learning models. Sharing some info here, please let me know if you need some more clarity. Referring to "train[feature].unique())<25" - it collects features with unique less than 25 unique values. Otherwise it may not make sense to plot all the categorical features in place. For example: If we have 100 unique values in features, it may not be meaningful to draw insights from that variable via visualization. Now why is not meaningful or worth? - Basically, if we go ahead with doing it, you will be seeing 100 number of categories and there is hardly anything that we could infer from it even if we visualize, hence as a best practice people go with 25 or any other number that it is lesser or slightly more. This is done more for the convenience of visualization and has got no relevance in building a machine learning model.
discrete_feature=[feature for feature in numeric_features if len(train[feature].unique())<25 and feature not in year_feature + ['Id']]
print("Discrete Variables Count: {}".format(len(discrete_feature)))
"""
Understanding the code above: This is list comprehension
If you have noticed some of the earlier cells (i.e. section Temporal variables), year_feature is nothing but list of all the features/variables that
contains 'Yr' or 'Year' in its name, for example: 'YearBuilt' and 'GarageYrBlt' are two among them.
numeric_features is a list of all the features which are numerical in nature (just above Temporal Variable section)
Now coming to code above discrete_features = ..............
- train[feature].unique() will return all the unique values the particular feature is containing.
- len(train[feature].unique()) will give you total number of elements in train[feature].unique() i.e. total number of unique values that feature has
Explaining the code:
for each feature in numeric feature it will check if that feature has unique values less than 25 and that feature is not included in year_feature+['Id']
(year_feature + ['Id'] means all the list of features in year_feature including 'Id'), and if both the conditions are satisfied then select that feature
and this feature will be included in discrete feature.
Further codes explanation:
If you see below, there are 17 discrete variables which do not have 'Yr' or 'Year' in its name and has unique values less than 25.
You can see all the discrete features shown below.
Hope it helped you understanding the code. If not let me know.
"""
Discrete Variables Count: 17
" \nUnderstanding the code above: This is list comprehension\n\nIf you have noticed some of the earlier cells (i.e. section Temporal variables), year_feature is nothing but list of all the features/variables that \ncontains 'Yr' or 'Year' in its name, for example: 'YearBuilt' and 'GarageYrBlt' are two among them.\n\nnumeric_features is a list of all the features which are numerical in nature (just above Temporal Variable section)\nNow coming to code above discrete_features = ..............\n- train[feature].unique() will return all the unique values the particular feature is containing.\n- len(train[feature].unique()) will give you total number of elements in train[feature].unique() i.e. total number of unique values that feature has\n\nExplaining the code:\nfor each feature in numeric feature it will check if that feature has unique values less than 25 and that feature is not included in year_feature+['Id'] \n(year_feature + ['Id'] means all the list of features in year_feature including 'Id'), and if both the conditions are satisfied then select that feature \nand this feature will be included in discrete feature.\n\nFurther codes explanation:\nIf you see below, there are 17 discrete variables which do not have 'Yr' or 'Year' in its name and has unique values less than 25.\nYou can see all the discrete features shown below.\n\nHope it helped you understanding the code. If not let me know.\n"
train[discrete_feature].head()
| MSSubClass | OverallQual | OverallCond | LowQualFinSF | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | TotRmsAbvGrd | Fireplaces | GarageCars | 3SsnPorch | PoolArea | MiscVal | MoSold | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 60 | 7 | 5 | 0 | 1 | 0 | 2 | 1 | 3 | 1 | 8 | 0 | 2 | 0 | 0 | 0 | 2 |
| 1 | 20 | 6 | 8 | 0 | 0 | 1 | 2 | 0 | 3 | 1 | 6 | 1 | 2 | 0 | 0 | 0 | 5 |
| 2 | 60 | 7 | 5 | 0 | 1 | 0 | 2 | 1 | 3 | 1 | 6 | 1 | 2 | 0 | 0 | 0 | 9 |
| 3 | 70 | 7 | 5 | 0 | 1 | 0 | 1 | 0 | 3 | 1 | 7 | 1 | 3 | 0 | 0 | 0 | 2 |
| 4 | 60 | 8 | 5 | 0 | 1 | 0 | 2 | 1 | 4 | 1 | 9 | 1 | 3 | 0 | 0 | 0 | 12 |
for feature in discrete_feature:
data=train.copy()
data.groupby(feature)['SalePrice'].median().plot.bar()
plt.xlabel(feature, fontsize=12, fontfamily='serif')
plt.ylabel('SalePrice', fontsize=12, fontfamily='serif')
plt.title(feature, fontsize=12, fontfamily='serif', fontweight = 'bold')
g = sns.despine(top=True, left = True, right=True, bottom = True)
plt.tick_params(which=u'both', axis=u'both', length=0)
# Show the plot
plt.show()
continuous_feature=[feature for feature in numeric_features if feature not in discrete_feature+year_feature+['Id']]
print("Continuous Feature Count {}".format(len(continuous_feature)))
Continuous Feature Count 16
Let us analyse the continuous values with data visualisation to understand the data distribution
for feature in continuous_feature:
data=train.copy()
data[feature].hist(bins=25, grid = False)
plt.xlabel(feature, fontsize = 12, fontfamily = 'serif')
plt.ylabel("Count", fontsize = 12, fontfamily = 'serif')
plt.title(feature, fontsize = 15, fontfamily = 'serif', fontweight = 'bold')
g = sns.despine(top=True, left = True, right=True, bottom = True)
plt.tick_params(which=u'both', axis=u'both', length=0)
# Show the plot
plt.show()
categorical_features = train.select_dtypes(include=[object])
categorical_features.columns
Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
'SaleType', 'SaleCondition'],
dtype='object')
Visualising missing values for a sample of 250
msno.matrix(train.sample(250))
plt.show()
The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another:
msno.heatmap(train)
plt.show()
msno.bar(train.sample(1000))
plt.show()
The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap:
msno.dendrogram(train)
plt.show()
The dendrogram uses a hierarchical clustering algorithm (courtesy of scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.
To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of zero fully predict one another's presence—one variable might always be empty when another is filled, or they might always both be filled or both empty, and so on. In this specific example the dendrogram glues together the variables which are required and therefore present in every record.
Cluster leaves which split close to zero, but not at it, predict one another very well, but still imperfectly. If your own interpretation of the dataset is that these columns actually are or ought to be match each other in nullity , then the height of the cluster leaf tells you, in absolute terms, how often the records are "mismatched" or incorrectly filed—that is, how many values you would have to fill in or drop, if you are so inclined.
As with matrix, only up to 50 labeled columns will comfortably display in this configuration. However the dendrogram more elegantly handles extremely large datasets by simply flipping to a horizontal configuration.
The Challenges of Your Data
Now that we have gathered some basic information on your data, it’s a good idea to just go a little bit deeper into the challenges that the data might pose.
There are two factors mostly observed in EDA exercise which are missing values and outliers For understanding in detail on how to handle missing values in detail please visit https://www.kaggle.com/pavansanagapati/simple-tutorial-on-how-to-handle-missing-data For determining the outliers boxplot is used in the later part of this kernel
Estimate Skewness and Kurtosis
train.skew(), train.kurt()
C:\Users\HP\AppData\Local\Temp\ipykernel_3896\833229632.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. train.skew(), train.kurt()
(Id 0.000000 MSSubClass 1.407657 LotFrontage 2.163569 LotArea 12.207688 OverallQual 0.216944 OverallCond 0.693067 YearBuilt -0.613461 YearRemodAdd -0.503562 MasVnrArea 2.669084 BsmtFinSF1 1.685503 BsmtFinSF2 4.255261 BsmtUnfSF 0.920268 TotalBsmtSF 1.524255 1stFlrSF 1.376757 2ndFlrSF 0.813030 LowQualFinSF 9.011341 GrLivArea 1.366560 BsmtFullBath 0.596067 BsmtHalfBath 4.103403 FullBath 0.036562 HalfBath 0.675897 BedroomAbvGr 0.211790 KitchenAbvGr 4.488397 TotRmsAbvGrd 0.676341 Fireplaces 0.649565 GarageYrBlt -0.649415 GarageCars -0.342549 GarageArea 0.179981 WoodDeckSF 1.541376 OpenPorchSF 2.364342 EnclosedPorch 3.089872 3SsnPorch 10.304342 ScreenPorch 4.122214 PoolArea 14.828374 MiscVal 24.476794 MoSold 0.212053 YrSold 0.096269 SalePrice 1.882876 dtype: float64, Id -1.200000 MSSubClass 1.580188 LotFrontage 17.452867 LotArea 203.243271 OverallQual 0.096293 OverallCond 1.106413 YearBuilt -0.439552 YearRemodAdd -1.272245 MasVnrArea 10.082417 BsmtFinSF1 11.118236 BsmtFinSF2 20.113338 BsmtUnfSF 0.474994 TotalBsmtSF 13.250483 1stFlrSF 5.745841 2ndFlrSF -0.553464 LowQualFinSF 83.234817 GrLivArea 4.895121 BsmtFullBath -0.839098 BsmtHalfBath 16.396642 FullBath -0.857043 HalfBath -1.076927 BedroomAbvGr 2.230875 KitchenAbvGr 21.532404 TotRmsAbvGrd 0.880762 Fireplaces -0.217237 GarageYrBlt -0.418341 GarageCars 0.220998 GarageArea 0.917067 WoodDeckSF 2.992951 OpenPorchSF 8.490336 EnclosedPorch 10.430766 3SsnPorch 123.662379 ScreenPorch 18.439068 PoolArea 223.268499 MiscVal 701.003342 MoSold -0.404109 YrSold -1.190601 SalePrice 6.536282 dtype: float64)
y = train['SalePrice']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
C:\Users\HP\AppData\Local\Temp\ipykernel_3896\1021860808.py:3: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(y, kde=False, fit=st.johnsonsu) C:\Users\HP\AppData\Local\Temp\ipykernel_3896\1021860808.py:5: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(y, kde=False, fit=st.norm) C:\Users\HP\AppData\Local\Temp\ipykernel_3896\1021860808.py:7: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(y, kde=False, fit=st.lognorm)
<AxesSubplot: title={'center': 'Log Normal'}, xlabel='SalePrice'>
It is apparent that SalePrice doesn't follow normal distribution, so before performing regression it has to be transformed. While log transformation does pretty good job, best fit is unbounded Johnson distribution.
sns.distplot(train.skew(),color='blue',axlabel ='Skewness')
C:\Users\HP\AppData\Local\Temp\ipykernel_3896\1860264502.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. sns.distplot(train.skew(),color='blue',axlabel ='Skewness') C:\Users\HP\AppData\Local\Temp\ipykernel_3896\1860264502.py:1: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(train.skew(),color='blue',axlabel ='Skewness')
<AxesSubplot: xlabel='Skewness', ylabel='Density'>
plt.figure(figsize = (12,8))
sns.distplot(train.kurt(),color='r',axlabel ='Kurtosis',norm_hist= False, kde = True,rug = False)
#plt.hist(train.kurt(),orientation = 'vertical',histtype = 'bar',label ='Kurtosis', color ='blue')
plt.show()
C:\Users\HP\AppData\Local\Temp\ipykernel_3896\2341891893.py:2: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError. Select only valid columns before calling the reduction. sns.distplot(train.kurt(),color='r',axlabel ='Kurtosis',norm_hist= False, kde = True,rug = False) C:\Users\HP\AppData\Local\Temp\ipykernel_3896\2341891893.py:2: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(train.kurt(),color='r',axlabel ='Kurtosis',norm_hist= False, kde = True,rug = False)
plt.hist(train['SalePrice'],orientation = 'vertical',histtype = 'bar', color ='blue')
plt.show()
target = np.log(train['SalePrice'])
target.skew()
plt.hist(target,color='blue')
(array([ 5., 12., 54., 184., 470., 400., 220., 90., 19., 6.]),
array([10.46024211, 10.7676652 , 11.07508829, 11.38251138, 11.68993448,
11.99735757, 12.30478066, 12.61220375, 12.91962684, 13.22704994,
13.53447303]),
<a list of 10 Patch objects>)
Finding Correlation coefficients between numeric features and SalePrice
correlation = numeric_features.corr()
print(correlation['SalePrice'].sort_values(ascending = False),'\n')
SalePrice 1.000000 OverallQual 0.790982 GrLivArea 0.708624 GarageCars 0.640409 GarageArea 0.623431 TotalBsmtSF 0.613581 1stFlrSF 0.605852 FullBath 0.560664 TotRmsAbvGrd 0.533723 YearBuilt 0.522897 YearRemodAdd 0.507101 GarageYrBlt 0.486362 MasVnrArea 0.477493 Fireplaces 0.466929 BsmtFinSF1 0.386420 LotFrontage 0.351799 WoodDeckSF 0.324413 2ndFlrSF 0.319334 OpenPorchSF 0.315856 HalfBath 0.284108 LotArea 0.263843 BsmtFullBath 0.227122 BsmtUnfSF 0.214479 BedroomAbvGr 0.168213 ScreenPorch 0.111447 PoolArea 0.092404 MoSold 0.046432 3SsnPorch 0.044584 BsmtFinSF2 -0.011378 BsmtHalfBath -0.016844 MiscVal -0.021190 Id -0.021917 LowQualFinSF -0.025606 YrSold -0.028923 OverallCond -0.077856 MSSubClass -0.084284 EnclosedPorch -0.128578 KitchenAbvGr -0.135907 Name: SalePrice, dtype: float64
To explore further we will start with the following visualisation methods to analyze the data better:
f , ax = plt.subplots(figsize = (14,12))
plt.title('Correlation of Numeric Features with Sale Price',y=1,size=16)
sns.heatmap(correlation,square = True, vmax=0.8)
<AxesSubplot: title={'center': 'Correlation of Numeric Features with Sale Price'}>
The heatmap is the best way to get a quick overview of correlated features thanks to seaborn!
At initial glance it is observed that there are two red colored squares that get my attention.
Heatmaps are great to detect this kind of multicollinearity situations and in problems related to feature selection like this project, it comes as an excellent exploratory tool.
Another aspect I observed here is the 'SalePrice' correlations.As it is observed that 'GrLivArea', 'TotalBsmtSF', and 'OverallQual' saying a big 'Hello !' to SalePrice, however we cannot exclude the fact that rest of the features have some level of correlation to the SalePrice. To observe this correlation closer let us see it in Zoomed Heat Map
k= 11
cols = correlation.nlargest(k,'SalePrice')['SalePrice'].index
print(cols)
cm = np.corrcoef(train[cols].values.T)
f , ax = plt.subplots(figsize = (14,12))
sns.heatmap(cm, vmax=.8, linewidths=0.01,square=True,annot=True,cmap='viridis',
linecolor="white",xticklabels = cols.values ,annot_kws = {'size':12},yticklabels = cols.values)
Index(['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea',
'TotalBsmtSF', '1stFlrSF', 'FullBath', 'TotRmsAbvGrd', 'YearBuilt',
'YearRemodAdd'],
dtype='object')
<AxesSubplot: >
From above zoomed heatmap it is observed that GarageCars & GarageArea are closely correlated . Similarly TotalBsmtSF and 1stFlrSF are also closely correlated.
My observations :
sns.set()
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.pairplot(train[columns],size = 2 ,kind ='scatter',diag_kind='kde')
plt.show()
/usr/local/lib/python3.7/dist-packages/seaborn/axisgrid.py:2076: UserWarning: The `size` parameter has been renamed to `height`; please update your code. warnings.warn(msg, UserWarning)
Although we already know some of the main figures, this pair plot gives us a reasonable overview insight about the correlated features .Here are some of my analysis.
One interesting observation is between 'TotalBsmtSF' and 'GrLiveArea'. In this figure we can see the dots drawing a linear line, which almost acts like a border. It totally makes sense that the majority of the dots stay below that line. Basement areas can be equal to the above ground living area, but it is not expected a basement area bigger than the above ground living area.
One more interesting observation is between 'SalePrice' and 'YearBuilt'. In the bottom of the 'dots cloud', we see what almost appears to be a exponential function.We can also see this same tendency in the upper limit of the 'dots cloud'
fig, ((ax1, ax2), (ax3, ax4),(ax5,ax6)) = plt.subplots(nrows=3, ncols=2, figsize=(14,10))
OverallQual_scatter_plot = pd.concat([train['SalePrice'],train['OverallQual']],axis = 1)
sns.regplot(x='OverallQual',y = 'SalePrice',data = OverallQual_scatter_plot,scatter= True, fit_reg=True, ax=ax1)
TotalBsmtSF_scatter_plot = pd.concat([train['SalePrice'],train['TotalBsmtSF']],axis = 1)
sns.regplot(x='TotalBsmtSF',y = 'SalePrice',data = TotalBsmtSF_scatter_plot,scatter= True, fit_reg=True, ax=ax2)
GrLivArea_scatter_plot = pd.concat([train['SalePrice'],train['GrLivArea']],axis = 1)
sns.regplot(x='GrLivArea',y = 'SalePrice',data = GrLivArea_scatter_plot,scatter= True, fit_reg=True, ax=ax3)
GarageArea_scatter_plot = pd.concat([train['SalePrice'],train['GarageArea']],axis = 1)
sns.regplot(x='GarageArea',y = 'SalePrice',data = GarageArea_scatter_plot,scatter= True, fit_reg=True, ax=ax4)
FullBath_scatter_plot = pd.concat([train['SalePrice'],train['FullBath']],axis = 1)
sns.regplot(x='FullBath',y = 'SalePrice',data = FullBath_scatter_plot,scatter= True, fit_reg=True, ax=ax5)
YearBuilt_scatter_plot = pd.concat([train['SalePrice'],train['YearBuilt']],axis = 1)
sns.regplot(x='YearBuilt',y = 'SalePrice',data = YearBuilt_scatter_plot,scatter= True, fit_reg=True, ax=ax6)
YearRemodAdd_scatter_plot = pd.concat([train['SalePrice'],train['YearRemodAdd']],axis = 1)
YearRemodAdd_scatter_plot.plot.scatter('YearRemodAdd','SalePrice')
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.
<matplotlib.axes._subplots.AxesSubplot at 0x7fb269d90f50>
saleprice_overall_quality= train.pivot_table(index ='OverallQual',values = 'SalePrice', aggfunc = np.median)
saleprice_overall_quality.plot(kind = 'bar',color = 'blue')
plt.xlabel('Overall Quality')
plt.ylabel('Median Sale Price')
plt.show()
var = 'OverallQual'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(12, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data,
showmeans = True,meanline=True,
palette="viridis",
meanprops = {'linewidth':2, 'color':'red'})
fig.axis(ymin=0, ymax=800000);
var = 'Neighborhood'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 10))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
xt = plt.xticks(rotation=45)
plt.figure(figsize = (12, 6))
sns.countplot(x = 'Neighborhood', data = data)
xt = plt.xticks(rotation=45)
Based on the above observation can group those Neighborhoods with similar housing price into a same bucket for dimension-reduction.Let us see this in the preprocessing stage
With qualitative variables we can check distribution of SalePrice with respect to variable values and enumerate them.
for c in categorical_features:
train[c] = train[c].astype('category')
if train[c].isnull().any():
train[c] = train[c].cat.add_categories(['MISSING'])
train[c] = train[c].fillna('MISSING')
def boxplot(x, y, **kwargs):
sns.boxplot(x=x, y=y)
x=plt.xticks(rotation=90)
f = pd.melt(train, id_vars=['SalePrice'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, height=5)
g = g.map(boxplot, "value", "SalePrice")
var = 'SaleType'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 10))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
xt = plt.xticks(rotation=45)
var = 'SaleCondition'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 10))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);
xt = plt.xticks(rotation=45)
sns.violinplot(x = 'Functional', y = 'SalePrice', data = train);
sns.factorplot('FireplaceQu', 'SalePrice', data = train, color = 'm', \
estimator = np.median, order = ['Ex', 'Gd', 'TA', 'Fa', 'Po'], height = 4.5, aspect=1.35)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[76], line 1 ----> 1 sns.factorplot('FireplaceQu', 'SalePrice', data = train, color = 'm', \ 2 estimator = np.median, order = ['Ex', 'Gd', 'TA', 'Fa', 'Po'], height = 4.5, aspect=1.35) AttributeError: module 'seaborn' has no attribute 'factorplot'
g = sns.FacetGrid(train, col = 'FireplaceQu', col_wrap = 3, col_order=['Ex', 'Gd', 'TA', 'Fa', 'Po'])
g.map(sns.boxplot, 'Fireplaces', 'SalePrice', order = [1, 2, 3], palette = 'Set2')
<seaborn.axisgrid.FacetGrid at 0x7fb2685eee90>
plt.figure(figsize=(8,10))
g1 = sns.pointplot(x='Neighborhood', y='SalePrice',
data=train, hue='LotShape')
g1.set_xticklabels(g1.get_xticklabels(),rotation=90)
g1.set_title("Lotshape Based on Neighborhood", fontsize=15)
g1.set_xlabel("Neighborhood")
g1.set_ylabel("Sale Price", fontsize=12)
plt.show()
We will first check the percentage of missing values present in each feature
data = pd.read_csv("train1.csv")
features_with_na=[features for features in data.columns if data[features].isnull().sum()>1]
for feature in features_with_na:
print(feature, np.round(data[feature].isnull().mean(), 4), ' % of Missing Values')
LotFrontage 0.1774 % of Missing Values Alley 0.9377 % of Missing Values MasVnrType 0.0055 % of Missing Values MasVnrArea 0.0055 % of Missing Values BsmtQual 0.0253 % of Missing Values BsmtCond 0.0253 % of Missing Values BsmtExposure 0.026 % of Missing Values BsmtFinType1 0.0253 % of Missing Values BsmtFinType2 0.026 % of Missing Values FireplaceQu 0.4726 % of Missing Values GarageType 0.0555 % of Missing Values GarageYrBlt 0.0555 % of Missing Values GarageFinish 0.0555 % of Missing Values GarageQual 0.0555 % of Missing Values GarageCond 0.0555 % of Missing Values PoolQC 0.9952 % of Missing Values Fence 0.8075 % of Missing Values MiscFeature 0.963 % of Missing Values
for feature in features_with_na:
dataset = data.copy()
dataset[feature] = np.where(dataset[feature].isnull(), 1, 0)
# Calculate the mean of SalePrice where the information is missing or present
dataset.groupby(feature)['SalePrice'].median().plot.bar()
plt.title(feature)
plt.show()
total = numeric_features.isnull().sum().sort_values(ascending=False)
percent = (numeric_features.isnull().sum()/numeric_features.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1,join='outer', keys=['Total Missing Count', '% of Total Observations'])
missing_data.index.name =' Numeric Feature'
missing_data.head(20)
| Total Missing Count | % of Total Observations | |
|---|---|---|
| Numeric Feature | ||
| LotFrontage | 259 | 0.177397 |
| GarageYrBlt | 81 | 0.055479 |
| MasVnrArea | 8 | 0.005479 |
| Id | 0 | 0.000000 |
| OpenPorchSF | 0 | 0.000000 |
| KitchenAbvGr | 0 | 0.000000 |
| TotRmsAbvGrd | 0 | 0.000000 |
| Fireplaces | 0 | 0.000000 |
| GarageCars | 0 | 0.000000 |
| GarageArea | 0 | 0.000000 |
| WoodDeckSF | 0 | 0.000000 |
| EnclosedPorch | 0 | 0.000000 |
| HalfBath | 0 | 0.000000 |
| 3SsnPorch | 0 | 0.000000 |
| ScreenPorch | 0 | 0.000000 |
| PoolArea | 0 | 0.000000 |
| MiscVal | 0 | 0.000000 |
| MoSold | 0 | 0.000000 |
| YrSold | 0 | 0.000000 |
| BedroomAbvGr | 0 | 0.000000 |
missing_values = numeric_features.isnull().sum(axis=0).reset_index()
missing_values.columns = ['column_name', 'missing_count']
missing_values = missing_values.loc[missing_values['missing_count']>0]
missing_values = missing_values.sort_values(by='missing_count')
ind = np.arange(missing_values.shape[0])
width = 0.1
fig, ax = plt.subplots(figsize=(12,3))
rects = ax.barh(ind, missing_values.missing_count.values, color='b')
ax.set_yticks(ind)
ax.set_yticklabels(missing_values.column_name.values, rotation='horizontal')
ax.set_xlabel("Missing Observations Count")
ax.set_title("Missing Observations Count - Numeric Features")
plt.show()
Let us look at the missing values in categorical features in detail
total = categorical_features.isnull().sum().sort_values(ascending=False)
percent = (categorical_features.isnull().sum()/categorical_features.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1,join='outer', keys=['Total Missing Count', ' % of Total Observations'])
missing_data.index.name ='Feature'
missing_data.head(20)
| Total Missing Count | % of Total Observations | |
|---|---|---|
| Feature | ||
| PoolQC | 1453 | 0.995205 |
| MiscFeature | 1406 | 0.963014 |
| Alley | 1369 | 0.937671 |
| Fence | 1179 | 0.807534 |
| FireplaceQu | 690 | 0.472603 |
| GarageType | 81 | 0.055479 |
| GarageCond | 81 | 0.055479 |
| GarageQual | 81 | 0.055479 |
| GarageFinish | 81 | 0.055479 |
| BsmtFinType2 | 38 | 0.026027 |
| BsmtExposure | 38 | 0.026027 |
| BsmtFinType1 | 37 | 0.025342 |
| BsmtQual | 37 | 0.025342 |
| BsmtCond | 37 | 0.025342 |
| MasVnrType | 8 | 0.005479 |
| Electrical | 1 | 0.000685 |
| Functional | 0 | 0.000000 |
| KitchenQual | 0 | 0.000000 |
| CentralAir | 0 | 0.000000 |
| HeatingQC | 0 | 0.000000 |
missing_values = categorical_features.isnull().sum(axis=0).reset_index()
missing_values.columns = ['column_name', 'missing_count']
missing_values = missing_values.loc[missing_values['missing_count']>0]
missing_values = missing_values.sort_values(by='missing_count')
ind = np.arange(missing_values.shape[0])
width = 0.9
fig, ax = plt.subplots(figsize=(12,18))
rects = ax.barh(ind, missing_values.missing_count.values, color='red')
ax.set_yticks(ind)
ax.set_yticklabels(missing_values.column_name.values, rotation='horizontal')
ax.set_xlabel("Missing Observations Count")
ax.set_title("Missing Observations Count - Categorical Features")
plt.show()
Let us look at the unique values in categorical features in both train and test dataframes in detail
for column_name in train.columns:
if train[column_name].dtypes == 'object':
train[column_name] = train[column_name].fillna(train[column_name].mode().iloc[0])
unique_category = len(train[column_name].unique())
print("Feature '{column_name}' has '{unique_category}' unique categories".format(column_name = column_name,
unique_category=unique_category))
for column_name in test.columns:
if test[column_name].dtypes == 'object':
test[column_name] = test[column_name].fillna(test[column_name].mode().iloc[0])
unique_category = len(test[column_name].unique())
print("Features in test set '{column_name}' has '{unique_category}' unique categories".format(column_name = column_name, unique_category=unique_category))
Features in test set 'MSZoning' has '5' unique categories Features in test set 'Street' has '2' unique categories Features in test set 'Alley' has '2' unique categories Features in test set 'LotShape' has '4' unique categories Features in test set 'LandContour' has '4' unique categories Features in test set 'Utilities' has '1' unique categories Features in test set 'LotConfig' has '5' unique categories Features in test set 'LandSlope' has '3' unique categories Features in test set 'Neighborhood' has '25' unique categories Features in test set 'Condition1' has '9' unique categories Features in test set 'Condition2' has '5' unique categories Features in test set 'BldgType' has '5' unique categories Features in test set 'HouseStyle' has '7' unique categories Features in test set 'RoofStyle' has '6' unique categories Features in test set 'RoofMatl' has '4' unique categories Features in test set 'Exterior1st' has '13' unique categories Features in test set 'Exterior2nd' has '15' unique categories Features in test set 'MasVnrType' has '4' unique categories Features in test set 'ExterQual' has '4' unique categories Features in test set 'ExterCond' has '5' unique categories Features in test set 'Foundation' has '6' unique categories Features in test set 'BsmtQual' has '4' unique categories Features in test set 'BsmtCond' has '4' unique categories Features in test set 'BsmtExposure' has '4' unique categories Features in test set 'BsmtFinType1' has '6' unique categories Features in test set 'BsmtFinType2' has '6' unique categories Features in test set 'Heating' has '4' unique categories Features in test set 'HeatingQC' has '5' unique categories Features in test set 'CentralAir' has '2' unique categories Features in test set 'Electrical' has '4' unique categories Features in test set 'KitchenQual' has '4' unique categories Features in test set 'Functional' has '7' unique categories Features in test set 'FireplaceQu' has '5' unique categories Features in test set 'GarageType' has '6' unique categories Features in test set 'GarageFinish' has '3' unique categories Features in test set 'GarageQual' has '4' unique categories Features in test set 'GarageCond' has '5' unique categories Features in test set 'PavedDrive' has '3' unique categories Features in test set 'PoolQC' has '2' unique categories Features in test set 'Fence' has '4' unique categories Features in test set 'MiscFeature' has '3' unique categories Features in test set 'SaleType' has '9' unique categories Features in test set 'SaleCondition' has '6' unique categories
for feature in continuous_feature:
data=train.copy()
if 0 in data[feature].unique():
pass
else:
data[feature]=np.log(data[feature])
data.boxplot(column=feature)
plt.ylabel(feature)
plt.title(feature)
plt.show()
for feature in categorical_features:
data=train.copy()
data.groupby(feature)['SalePrice'].median().plot.bar()
plt.xlabel(feature)
plt.ylabel('SalePrice')
plt.title(feature)
plt.show()